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Abstract 

This paper introduces a new type of grammar 
learning algorithm, inspired by string edit dis- 
tance (Wagner and Fischer, 1974). The algo- 



rithm takes a corpus of flat sentences as input 
and returns a corpus of labelled, bracketed sen- 
tences. The method works on pairs of unstruc- 
tured sentences that have one or more words in 
common. When two sentences are divided into 
parts that are the same in both sentences and 
parts that are different, this information is used 
to find parts that are interchangeable. These 
parts are taken as possible constituents of the 
same type. After this alignment learning step, 
the selection learning step selects the most prob- 
able constituents from all possible constituents. 

This method was used to bootstrap structure 
on the ATIS corpus ( [Marcus et al., 1993|) an d 
on the OVIS0 corpus ( [Bonnema et al., 1997 ). 



While the results are encouraging (we obtained 
up to 89.25 % non-crossing brackets precision), 
this paper will point out some of the shortcom- 
ings of our approach and will suggest possible 
solutions. 

1 Introduction 

Unsupervised learning of syntactic structure is 
one of the hardest problems in NLP. Although 
people are adept at learning grammatical struc- 
ture, it is difficult to model this process and 
therefore it is hard to make a computer learn 
structure. 

We do not claim that the algorithm described 
here models the human process of language 
learning. Instead, the algorithm should, given 
unstructured sentences, find the best structure. 
This means that the algorithm should assign 



^Openbaar Vervoer Informatie Systeem (OVIS) 
stands for Public Transport Information System. 



structure to sentences which is similar to the 
structure people would give to sentences, but 
not necessarily in the same time or space re- 
strictions. 

The algorithm consists of two phases. The 
first phase is a constituent generator, which gen- 
erates a motivated set of possible constituents 
by aligning sentences. The second phase re- 
stricts this set by selecting the best constituents 
from the set. 

The rest of this paper is organized as fol- 
lows. Firstly, we will start by describing previ- 
ous work in machine learning of language struc- 
ture and then we will give a description of the 
ABL algorithm. Next, some results of applying 
the ABL algorithm to different corpora will be 
given, followed by a discussion of the algorithm 
and future research. 

2 Previous Work 

Learning methods can be grouped into super- 
vised and unsupervised methods. Supervised 
methods are initialised with structured input 
(i.e. structured sentences for grammar learning 
methods), while unsupervised methods learn by 
using unstructured data only. 

In practice, supervised methods outperform 
unsupervised methods, since they can adapt 
their output based on the structured examples 
in the initialisation phase whereas unsupervised 
methods cannot. However, it is worthwhile 
to investigate unsupervised grammar learning 
methods, since "the costs of annotation are pro- 
hibitively time and expertise intensive, and the 
resulting corpora may be too susceptible to re- 
striction to a particular domain, application, or 
genre". (|Kehler and Stolcke, 19991 ) 

There have been several approaches to the un- 
supervised learning of syntactic structures. We 
will give a short overview here. 



Memory based learning (MBL) keeps track of 
possible contexts and assigns word types based 



Red- 



on that information ( Daelemans, 1995 ) 
ington et al. (1998D present a method that 



bootstraps syntactic categories using distribu- 



What is a family fare 

What is the payload of an African Swallow 
What is (a family fare )x 

What is (the payload of an African Swallow) x 



tional information and [Magerman and Marcus] figure 1: Example bootstrapping structure 



(1990 ) describe a method that finds constituent 
boundaries using mutual information values of 
the part of speech n-grams within a sentence. 

Algorithms that use the minimum description 
length (MDL) principle build grammars that 
describe the input sentences using the minimal 
number of bits. This idea stems from informa- 
tion theory. Examples of these systems can be 
found in ( priinwald, 199'^ ) and ( |de Marcken, 
"19961) . 



The system by Wolff (1982) performs a 
heuristic search while creating and merging 
symbols directed by an evaluation function. 
Chen (1995| ) presents a Bayesian grammar in- 
duction method, which is followed by a post- 
pass using the inside-outside algorithm ( paker, 
T979[ ILari and Young, 199q ). 



Most work described here cannot learn com- 
plex structures such as recursion, while other 
systems only use limited context to find con- 
stituents. However, the two phases in ABL 
are closely related to some previous work. 
The alignment learning phase is effectively a 
compression technique comparable to MDL or 
Bayesian grammar induction methods. ABL 
remembers all possible constituents, building 
a search space. The selection learning phase 
searches this space, directed by a probabilistic 
evaluation function. 

3 Algorithm 

We will describe an algorithm that learns struc- 
ture using a corpus of plain (unstructured) sen- 
tences. It does not need a structured train- 
ing set to initialize, all structural information 
is gathered from the unstructured sentences. 

The output of the algorithm is a labelled, 
bracketed version of the input corpus. Although 
the algorithm does not generate a (context-free) 
grammar, it is trivial to deduce one from the 
structured corpus. 

The algorithm builds on Harris's idea ( |1951| ) 
that states that constituents of the same type 
can he replaced by each other. Consider the sen- 



For each sentence si in the corpus: 

For every other sentence S2 in the corpus: 
Align si to S2 

Find the identical and distinct parts 

between si and S2 
Assign non-terminals to the constituents 

(i.e. distinct parts of si and S2) 

Figure 2: Alignment learning algorithm 

fences as shown in figure |l[0 The constituents a 
family fare and the payload of an African Swal- 
low both have the same syntactic type (they 
are both NPs), so they can be replaced by each 
other. This means that when the constituent in 
the first sentence is replaced by the constituent 
in the second sentence, the result is a valid sen- 
tence in the language; it is the second sentence. 

The main goal of the algorithm is to estab- 
lish that a family fare and the payload of an 
African Swallow are constituents and have the 
same type. This is done by reversing Harris's 
idea: if (a group of) words can be replaced by 
each other, they are constituents and have the 
same type. So the algorithm now has to find 
groups of words that can be replaced by each 
other and after replacement still generate valid 
sentences. 

The algorithm consists of two steps: 

1. Alignment Learning 

2. Selection Learning 

3.1 Alignment Learning 

The model learns by comparing all sentences 
in the input corpus to each other in pairs. An 
overview of the algorithm can be found in fig- 
ure 

Aligning sentences results in "linking" iden- 
tical words in the sentences. Adjacent linked 
words are then grouped. This process reveals 



All sentences in the examples can be found in the 
ATIS corpus. 



from ()i San Francisco (to Dallas)2 
from (Dallas to)i San Francisco ()2 

from (San Francisco to)i Dallas ()2 
from ()i Dallas (to San Francisco)2 

from (San Franciscoji to (Dallas)2 
from (Dallas)i to (San Francisco)2 

Figure 3: Ambiguous alignments 



the groups of identical words, but it also uncov- 
ers the groups of distinct words in the sentences. 
In figure || What is is the identical part of the 
sentences and a family fare and the payload of 
an African Swallow are the distinct parts. The 
distinct parts are interchangeable, so they are 
determined to be constituents of the same type. 

We will now explain the steps in the align- 
ment learning phase in more detail. 

3.1.1 Edit Distance 

To find the identical word groups in the sen- 
tences, we use the edit distance algorithm by 
Wagner and Fischer (1974), which finds the 



minimum number of edit operations (insertion, 
deletion and substitution) to change one sen- 
tence into the other. Identical words in the sen- 
tences can be found at places where no edit op- 
eration was applied. 

The instantiation of the algorithm that finds 
the longest common subsequence in two sen- 
tences sometimes "links" words that are too 
far apart. In figure ^ when, besides the occur- 
rences of from, the occurrences of San Francisco 
or Dallas are linked, this results in unintended 
constituents. We would rather have the model 
linking to, resulting in a structure with the noun 
phrases grouped with the same type correctly. 

Linking San Francisco or Dallas results in 
constituents that vary widely in size. This stems 
from the large distance between the linked 
words in the first sentence and in the second 
sentence. This type of alignment can be ruled 
out by biasing the cost function using distances 
between words. 

3.1.2 Grouping 

An edit distance algorithm links identical words 
in two sentences. When adjacent words are 
linked in both sentences, they can be grouped. 
A group like this is a part of a sentence that can 



also be found in the other sentence. (In figure |T|, 
What is is a group like this.) 

The rest of the sentences can also be grouped. 
The words in these groups are words that are 
distinct in the two sentences. When all of these 
groups from sentence one would be replaced by 
the respective groups of sentence two, sentence 
two is generated, (a family fare and the pay- 
load of an African Swallow are of this type of 
group in figure |.) Each pair of these distinct 
groups consists of possible constituents of the 
same type.^ 

As can be seen in figure |3|, it is possible that 
empty groups can be learned. 

3.1.3 Existing Constituents 

At some point it may be possible that the model 
learns a constituent that was already stored. 
This may happen when a new sentence is com- 
pared to a sentence in the partially structured 
corpus. In this case, no new type is introduced, 
but the constituent in the new sentence gets the 
same type of the constituent in the sentence in 
the partially structured corpus. 

It may even be the case that a partially struc- 
tured sentence is compared to another partially 
structured sentence. This occurs when a sen- 
tence that contains some structure, which was 
learned by comparing to a sentence in the par- 
tially structured corpus, is compared to an- 
other (partially structured) sentence. When 
the comparison of these two sentences yields 
a constituent that was already present in both 
sentences, the types of these constituents are 
merged. All constituents of these types are up- 
dated, so they have the same type. 

By merging types of constituents we make the 
assumption that constituents in a certain con- 



text can only have one type. In section 5.2 we 
discuss the implications of this assumption and 
propose an alternative approach. 

3.2 Selection Learning 

The first step in the algorithm may at some 
point generate constituents that overlap with 
other constituents. In figure ^ Give me all 
flights from Dallas to Boston receives two over- 
lapping structures. One constituent is learned 



■^Since the algorithm does not know any (hnguistic) 
names for the types, the algorithm chooses natural num- 
bers to denote different types. 



( Book Delta 128 )from Dallas to Boston 
, " , 

( Give me ( all flights) from Dallas to Boston) 
Give me ( help on classes ) 

Figure 4: Overlapping constituents 

by comparing against Book Delta 128 from Dal- 
las to Boston and the other (overlapping) con- 
stituent is found by aligning with Give me help 
on classes. 

The solution to this problem has to do with 
selecting the correct constituents (or at least 
the better constituents) out of the possible con- 
stituents. Selecting constituents can be done in 
several different ways. 

ABL:incr Assume that the first constituent 
learned is the correct one. This means that 
when a new constituent overlaps with older 
constituents, it can be ignored (i.e. they are 
not stored in the corpus). 

ABL:leaf The model computes the probabil- 
ity of a constituent counting the number of 
times the particular words of the constituent 
have occurred in the learned text as a con- 
stituent, normalized by the total number of 
constituents. 



Pleafic) 



\c' G C : yield{c') = yield{c)\ 
\C\ 



where C is the entire set of constituents. 

ABL:branch In addition to the words of the 
sentence delimited by the constituent, the 
model computes the probability based on the 
part of the sentence delimited by the words 
of the constituent and its non-terminal (i.e. 
a normalised probability of ABLdeaf). 

Pbranch{c\rOOt{c) = r) = 

|c' G C : yield{c') = yield{c) A root{c') = r\ 
|c" E C : root{c") = r\ 

The first method is non-probabilistic and may 
be applied every time a constituent is found that 
overlaps with a known constituent (i.e. while 
learning). 

The two other methods are probabilistic. The 
model computes the probability of the con- 
stituents and then uses that probability to select 
constituents with the highest probability. These 



methods are applied after the alignment learn- 
ing phase, since more specific information (in 
the form of better counts) can be found at that 
time. 

In section ^ we will evaluate all three methods 
on the ATIS and OVIS corpus. 

3.2.1 Viterbi 

Since more than just two constituents can over- 
lap, all possible combinations of overlapping 
constituents should be considered when com- 
puting the best combination of constituents, 
which is the product of the probabilities of the 



separate constituents as in SCFGs (cf. ( [Booth 



1969D ). A Viterbi style algorithm optimization 
( 19671) is used to efficiently select the best com- 



bination of constituents. 

When computing the probability of a com- 
bination of constituents, multiplying the sepa- 
rate probabilities of the constituents biases to- 
wards a low number of constituents. Therefore, 
we compute the probability of a set of con- 
stituents using a normalized version, the geo- 
metric mean^ rather tha n its product. ( Cara- 
ballo and Charniak, 1998| ) 



4 Results 

The three different ABL algorithms and two 
baseline systems have been tested on the ATIS 
and OVIS corpora. 

The ATIS corpus from the Penn Treebank 
consists of 716 sentences containing 11,777 con- 
stituents. The larger OVIS corpus is a Dutch 
corpus containing sentences on travel informa- 
tion. It consists of exactly 10,000 sentences. We 
have removed all sentences containing only one 
word, resulting in a corpus of 6,797 sentences 
and 48,562 constituents. 

The sentences of the corpora are stripped 
of their structures. These plain sentences are 
used in the learning algorithms and the result- 
ing structure is compared to the structure of the 
original corpus. 

All ABL methods are tested ten times. The 
ABL:incr method is applied to random orders of 
the input corpus. The probabilistic ABL meth- 
ods select constituents at random when differ- 
ent combinations of constituents have the same 
probability. The results in table |^ show the 

''The geometric mean of a set of constituents 
ci,...,c„ is P(ci A . . . A c„) = Vnr=i 





AXIS 


OVIS 


NCBP 


NCBR 


ZCS 


NCBP 


NCBR 


ZCS 


LEFT 

RIGHT 

ABL:INCR 

ABLlLEAF 

ABL:BRANCH 


32.60 
82.70 

83.24 (1.17) 
81.42 (0.11) 
85.31 (0.01) 


76.82 
92.91 

87.21 (0.67) 
86.27 (0.06) 
89.31 (0.01) 


1.12 

38.83 

18.56 (2.32) 
21.63 (0.50) 
29.75 (0.00) 


51.23 
75.85 

88.71 (0.79) 
85.32 (0.02) 
89.25 (0.00) 


73.17 
86.66 

84.36 (1.10) 
79.96 (0.03) 
85.04 (0.00) 


25.22 
48.08 

45.11 (3.22) 
30.87 (0.09) 
42.20 (0.01) 



Table 1: Results of the AXIS and OVIS corpora 



mean and standard deviations (between brack- 
ets). 

Xhe two baseline systems, left and right, only 
build left and right branching trees respectively. 

Xhree metrics have been computed. NCBP 
stands for Non-Crossing Brackets Precision, 
which denotes the percentage of learned con- 
stituents that do not overlap with any con- 
stituents in the original corpus. NCBR is the 
Non-Crossing Brackets Recall and shows the 
percentage of constituents in the original cor- 
pus that do not overlap with any constituents 
in the learned corpus. Finally, ZCS stands for 
Zero-Crossing Sentences and represents the per- 
centage of sentences that do not have any over- 
lapping constituents. 

4.1 Evaluation 

Xhe incr model performs quite well considering 
the fact that it cannot recover from incorrect 
constituents, with a precision and recall of over 
80 %. Xhe order of the sentences however is 
quite important, since the standard deviation 
of the incr model is quite high (especially with 
the ZCS, reaching 3.22 % on the OVIS corpus). 

We expected the probabilistic methods to 
perform better, but the leaf model performs 
slightly worse. Xhe ZCS, however, is somewhat 
better, resulting in 21.63 % on the AXIS cor- 
pus. Furthermore, the standard deviations of 
the leaf model (and of the branch model) are 
close to %. Xhe statistical methods generate 
more precise results. 

Xhe branch model clearly outperform all 
other models. Using more specific statistics gen- 
erate better results. 

Although the results of the AXIS corpus and 
OVIS corpus differ, the conclusions that can be 
reached are similar. 



4.2 ABL Compared to Other Methods 

It is difficult to compare the results of the ABL 
model against other methods, since often dif- 
ferent corpora or metrics are used. Xhe meth- 
ods described by pereira and Schabes (199^ ) 
comes reasonably close to ours. Xhe unsuper- 
vised method learns structure on plain sentences 
from the AXIS corpus resulting in 37.35 % pre- 
cision, while the unsupervised ABL significantly 
outperforms this method, reaching 85.31 % pre- 
cision. Only their supervised version results in 
a slightly higher precision of 90.36 %. 

Xhe system that simply builds right branch- 
ing structures results in 82.70 % precision and 
92.91 % recall on the AXIS corpus, where ABL 
got 85.31 % and 89.31 %. Xhis was expected, 
since English is a right branching language; a 
left branching system performed much worse 
(32.60 % precision and 76.82 % recall). Con- 
versely, right branching would not do very well 
on a Japanese corpus (a left branching lan- 
guage). Since ABL does not have a preference 
for direction built in, we expect ABL to perform 
similarly on a Japanese corpus. 

5 Discussion and Future Extensions 
5.1 Recursion 

All ABL methods described here can learn re- 
cursive structures and have been found when 
applying ABL to the AXIS and OVIS corpus. 
As can be seen in figure |5|, the learned recur- 
sive structure is similar to the original struc- 
ture. Some structure has been removed to make 
it easier to see where the recursion occurs. 

Roughly, recursive structures are built in two 
steps. First, the algorithm generates the struc- 
ture with different non-terminals. Xhen, the 
two non-terminals are merged as described in 
section ^T3| . Xhe merging of the non-terminals 
may occur anywhere in the corpus, since all 
merged non-terminals are updated. 



learned Please explain the (field FLT DAY in the (tableji^ji^, 

original Please explain (the field FLT DAY in (the tahle)Np)NP 

learned Explain classes QW and (QX and (Y)^2)b2 

original Explain classes ((QW)np and (QX)np and (Yjjypjiyp 

Figure 5: Recursive structures learned in the AXIS corpus 



Show me the ( morning )x flights 
Show me the ( nonstop )x flights 

Figure 6: Wrong syntactic type 

5.2 Wrong Syntactic Type 



In section 3.1.3 we made the assumption that a 



constituent in a certain context can only have 
one type. This assumption introduces some 
problems. 

The sentence John likes visiting relatives il- 
lustrates such a problem. The constituent vis- 
iting relatives can be a noun phrase or a verb 
phrase. 

Another problem is illustrated in figure ^. 
When applying the ABL learning algorithm to 
these sentences, it will determine that morning 
and nonstop are of the same type. Unfortu- 
nately, morning is a noun, while nonstop is an 
adverb .0 

A future extension will not only look at the 
type of the constituents, but also at the con- 
text of the constituents. In the example, the 
constituent morning may also take the place of 
a subject position in other sentences, but the 
constituent nonstop never will. This informa- 
tion can be used to determine when to merge 
constituent types, effectively loosening the as- 
sumption. 

5.3 Weakening Exact Match 

When the ABL algorithms try to learn with two 
completely distinct sentences, nothing can be 
learned. If we weaken the exact match between 
words in the alignment step of the algorithm, it 
is possible to learn structure even with distinct 
sentences. 

Instead of linking exactly matching words, 
the algorithm should match words that are 
equivalent. An obvious way of implementing 
this is by making use of equivalence classes. (See 



^Harris's implication does hold in these sentences. 
nonstop can also be replaced by for example cheap (an- 
other adverb) and morning can be replaced by evening 
(another noun). 



for example ( Redington et al., 1998| ).) The idea 
behind equivalence classes is that words which 
are closely related are grouped together. 

A big advantage of equivalence classes is that 
they can be learned in an unsupervised way, so 
the resulting algorithm remains unsupervised. 

Words that are in the same equivalence class 
are said to be sufficiently equivalent, so the 
alignment algorithm may assume they are sim- 
ilar and may thus link them. Now sentences 
that do not have words in common, but do have 
words in the same equivalence class in common, 
can be used to learn structure. 

When using equivalence classes, more con- 
stituents are learned and more terminals in con- 
stituents may be seen as similar (according to 
the equivalence classes). This results in a much 
richer structured corpus. 

5.4 Alternative Statistics 

At the moment we have tested two different 
ways of computing the probability of a con- 
stituent: ABLdeaf which computes the prob- 
ability of the occurrence of the terminals in a 
constituent, and ABL:hranch which computes 
the probability of the occurrence of the termi- 
nals together with the root non-terminal in a 
constituent, based on the learned corpus. 

Of course, other models can be implemented. 
One interesting possibility takes a DOP-like ap- 
proach ( Bod, 1998| ), which also takes into ac- 
count the inner structure of the constituents. 

6 Conclusion 

We have introduced a new grammar learning al- 
gorithm based on comparing and aligning plain 
sentences; neither pre-labelled or bracketed sen- 
tences, nor pre-tagged sentences are used. It 
uses distinctions between sentences to find pos- 
sible constituents and afterwards selects the 
most probable ones. The output of the algo- 
rithm is a structured version of the corpus. 

By taking entire sentences into account, the 
context used by the model is not limited by win- 
dow size, instead arbitrarily large contexts are 



used. Furthermore, the model has the ability to 
learn recursion. 

Three different instances of the algorithm 
have been applied to two corpora of differ- 
ent size, the AXIS corpus (716 sentences) and 
the OVIS corpus (6,797 sentences), generating 
promising results. Although the OVIS corpus 
is almost ten times the size of the ATIS cor- 
pus, these corpora describe a small conceptual 
domain. We plan to apply the algorithms to 
larger domain corpora in the near future. 
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